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Abstract 

This paper extends the work of Gottlob, Lee, and Valiant (PODS 2009) J9), 
and considers worst-case bounds for the size of the result Q(D) of a conjunc- 
tive query Q to a database D given an arbitrary set of functional dependen- 
cies. The bounds in (9) are based on a "coloring" of the query variables. 
In order to extend the previous bounds to the setting of arbitrary functional 
dependencies, we leverage tools from information theory to formalize the 
original intuition that each color used represents some possible entropy of 
that variable, and bound the maximum possible size increase via a linear pro- 
gram that seeks to maximize how much more entropy is in the result of the 
query than the input. This new view allows us to precisely characterize the 
entropy structure of worst-case instances for conjunctive queries with simple 
functional dependencies (keys), providing new insights into the results of |9l. 
We extend these results to the case of general functional dependencies, pro- 
viding upper and lower bounds on the worst-case size increase. We identify 
the fundamental connection between the gap in these bounds and a central 
open question in information theory. 

Finally, we show that, while both the upper and lower bounds are given by 
exponentially large linear programs, one can distinguish in polynomial time 
whether the result of a query with an arbitrary set of functional dependencies 
can be any larger than the input database. 

1 Introduction 

In this paper, we are concerned with deriving worst-case size bounds for the result 
of a conjunctive query in terms of the structural properties of the query, and those 
of the input relations. This paper addresses the main open question left by Gott- 
lob, Lee, and Valiant (PODS 2009) 0, extending size bounds to the case where 
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the query is applied to a database that has an arbitrary set of general functional 
dependencies (as opposed to just 'simple' functional dependencies — those whose 
left-hand sides consist of a single variable — as was done in |9l). 

Conjunctive queries are the most fundamental and most widely used database 
queries, forming the core of relational algebra (5] [15] [Q. Conjunctive queries also 
correspond to nonrecursive datalog rules of the form 

Ro(uo) <- Rl(ui) A ... A R n (u m ), 

where Ri is a relation name of the underlying database D, Rq is the output relation, 
and where each argument is a list of \m\ variables, where \ui\ is the arity of 
the corresponding relation, and where the same variable can occur multiple times 
in one or more argument lists. We allow a single relation Ri to appear several 
times in the query, thus m > n. Throughout this paper we adopt this datalog rule 
representation for conjunctive queries. 

In general, the result of a conjunctive query can be exponentially large in the 
input size. Even in the case of bounded arities, the result can be substantially larger 
than the input relations. In the worst case, the output size is r k , where r is the size 
of the largest input relation and k is the arity of the output relation. Queries with 
very large outputs are sometimes unavoidable, but in most cases they are either 
ill -posed or anyway undesirable, as they can be disruptive to a multi-user DBMS. 
It is thus useful to recognize such queries, whenever possible. Obtaining good 
worst-case bounds for conjunctive queries is, moreover, relevant to view manage- 
ment [15 ] and data integration lfl4l[T5l . as well as to data exchange (UCEl, wnere 
data is transferred from a source database to a target database according to schema 
mappings that are specified via conjunctive queries. In this latter context, good 
bounds on the result size of a conjunctive query may be used for estimating the 
amount of data that needs to be materialized at the target site. 

In the area of query optimization, models for predicting the size of the output of 
a conjunctive query based on selectivity indices for relational operators have been 
developed |[22l[T2l l6l. The selectivity indices are obtained via sampling techniques 
(see, e.g. lfl9l [TTIO from existing database instances. Worst case bounds may be 
obtained by setting each selectivity index to 1 , thus assuming the maximum selec- 
tivity for each operator. Unfortunately, the resulting bounds are then often trivial 
(akin to the above r k bound). 

A new and very interesting characterization of the worst-case output size of join 
queries was very recently developed by Atserias, Grohe, and Marx Q- Their result 
is based on the notion of fractional edge cover |[T0l . and the associated concept of 
fractional edge-cover number p*(Q) of a join query Q. In particular, in iflOl it was 
shown that 

\Q{D)\ <rmax(Q,Dy*( Q ), (1) 
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where rmax(<5, D) represents the size of the largest relation among R±, . . . , R n in 
D. In [3] it was shown that this bound is essentially tight. 

In |9), these results were extended beyond join-queries, to general conjunctive 
queries (containing projections) and also to the setting in which the input rela- 
tions satisfy simple functional dependencies. This work introduced a new coloring 
scheme for query variables, and, accordingly, the association of a color number 
C(Q) with each query Q. Roughly, a valid coloring assigns a set C(X) of col- 
ors to each query variable X and requires that for each functional dependency 
XY — > Z, the colors of Z are contained in the union of the colors of X and Y. 
The color number C(Q) of Q is the maximum over all valid colorings of Q of the 
quotient of the number of colors appearing in the output (i.e., head) variables of 
Q by the maximum number of colors appearing in the variables of any input (i.e., 
body) atom of Q. It was shown that for a query Q and database D with a set of 
simple functional dependencies, 

\Q(D)\ <rmax(Q,D) c( - Q) . 

In this paper, we attempt to extend these results to the case where we have a 
general set of functional dependencies (including compound functional dependen- 
cies of the form X, Y, Z — > W.) In this setting, while the lower bound given by the 
color number holds, we illustrate that the color number no longer provides an up- 
per bound on the worst-case size increase. In fact, we provide a family of instances 
demonstrating that there is a super-constant gap between the true size increase and 
the bound given by the color number. 

In order to provide size bounds in this general setting we require machinery be- 
yond the color number. We use tools from information theory developed to analyze 
the precise interactions of multivariate distributions. In some sense, this approach 
formalizes the original intuition of the coloring scheme — that each color used rep- 
resents some possible entropy of that variable. We construct a linear program with 
entropies as the variables and the exponent of the worst-case size increase as the 
solution. Functional dependencies can be encoded as constraints in the linear pro- 
grams. The difficulty is determining which additional constraints must be added to 
the linear program to ensure that the solution is realizable as a database instance. 

This question, as it turns out, is crucially related to an old and ongoing in- 
vestigation at the heart of information theory: "which entropy structures can be 
instantiated in multivariate distributions?" (20] [24l [23 HH1 H31 - We cannot show 
that our upper bound is tight in this general setting, and believe that an explicit 
(even exponential-sized) characterization of the worst-case size increase is unlikely 
without significant advances in information theory. 

Nevertheless, the formalism and tools from information theory shed signifi- 
cant light on the setting in which all functional dependencies are simple — the case 
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considered in [9]. We revisit the color number, and the tight bounds on the size 
increase for queries with simple functional dependencies, providing an alternative 
formulation of the color number as the solution to a linear program whose variables 
are entropies. This formulation allows us to show that the settings for which we 
have tight bounds on the size increase have worst-case instances with particularly 
simple entropy-structures; specifically, all associated mutual information measures 
are nonnegative. 

Finally, while both our upper and lower bounds are given by linear programs 
that have exponentially many variables, we show that we can decide in polynomial 
time whether a query and set of functional dependencies is sparsity-preserving. In 
particular, we can efficiently decide whether the result of a query can be any larger 
than the input database. 

This paper is organized as follows. In Section |2] we state some useful def- 
initions of database terms, define the coloring scheme and the color number of 
a query, and provide definitions of the basic information theory quantities and the 
Shannon information inequalities. In Section [3] we identify the connection between 
entropy and worst-case instances, and prove our linear programming size bound. 
In Section H] we provide an alternative definition of the color number in terms of 
entropies, and identify the simple entropy structure of worst-case instances in the 
settings in which we have tight size bounds (the setting with simple functional 
dependencies). We leverage this understanding of the entropy structure of these 
instances to construct a family of instances that demonstrate a super-constant gap 
between our upper and lower bounds. Finally, in Section [51 we show that we can 
efficiently decide whether a query and set of functional dependencies can admit 
any size increase. 

2 Preliminaries 

We begin by giving basic definitions pertaining to database theory. We then de- 
fine the color number, and state the size bounds of (9]]. Finally, we define some 
information theoretic quantities, and define the Shannon information inequalities. 

2.1 Database Terminology 

As already stated in the Introduction, a conjunctive query has the form R(uq) <— 
Ri(u\)A. . .AR n (u m ), where each u% is a list of (not necessarily distinct) variables 
of length |ttj| = arity(Ri). Each variable occurring in the query head Rq(uq) 
must also occur in the body of the query. The set of all variables occurring in 
Q is denoted by var(Q). It is important to recall that a single relation Ri might 
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appear several times in the query, and thus m could be larger than n. A finite 
structure or database D = (Ud, Ri, ■ ■ ■ , Rk) consists of a finite universe Ud and 
relations R\, . . . ,Rk over Ud- The answer Q(D) of query Q over database D 
consists of the structure (Ud,Ro) whose unique relation Rq contains precisely all 
tuples 9(uq) such that 6 : var(Q) — > Urj is a substitution such that for each atom 
Ri(iij) appearing in the query body, 6{uj) S Ri. For ease of notation, we define 
rmax(Q, D) to be the number of tuples in the largest relation among R\ , . . . , R n 
in D. 

A (simple) attribute of a relation R identifies a column of R. An attribute list 
consists of a list (without repetition) of attributes of a relation R. A compound 
attribute is an attribute list with at least two attributes. A list consisting of a unique 
attribute A is identified with A. The list of all attributes of R is denoted by attr(R). 
If V is a list of attributes of R and t € R a tuple of R, then the F-value of t, denoted 
by t[V] consists of the tuple obtained as the ordered list of all values in ^-positions 
oft. 

If V and W are (possibly compound) attributes of R, then afunctional depen- 
dency (FD) V — > W on relation R expresses that for each t, t' G R, t[V] = t'[V] 
implies that t[W] = t'[W]. Thus each functional dependency V — > W is equiv- 
alent to a set containing a FD V — > A for each element A of W. If ^4 and B are 
single attributes, then the FD A — > B is called a simple FD. A (possibly com- 
pound) attribute K of i? is a key iff K — > attr(R) holds. Such a key is called a 
simple key if K is a simple attribute, otherwise it is called a compound key^ An 
argument position in an atom that corresponds to a simple key attribute is referred 
to as a keyed position. 

Definition 2.1. Given a conjunctive query 

Q = R (u ) <- i?i(«i) A ... A R n (u m ), 

we define chase(Q) to be the result of iteratively performing the following replace- 
ments: 

• Given two atoms Ri(uj) and Ri(uk) of the same relation, with the p th posi- 
tion a key for relation Ri, if the variable at the p th position ofuj is the same 
as the variable at the p th position ofuk, then for each h G 1, . . . , \uj \ let X 
be the variable that occurs at position h in Uj. We replace every instance of 
X that occurs anywhere in the query by the variable occurring at position 
h of Uf,, and proceed with the updated m 's. Finally, we remove the term 
Ri (uj ) from the conjunctive query. 

'Note: We do not require compound keys to be minimal. 
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While the above definition only applies to queries with simple keys, the chase 
operator extends to arbitrary functional dependencies, though we refer the reader 
to EH for details. 

The following fact confirms the intuition that the substitutions in Definition 12. 11 
do not affect the result of the query. 

Fact 2.2. Kl6\ [2] 0/ For any instance, the result of applying the query chase(Q) 
is identical to the output of applying Q. 

2.2 The Color Number 

We restate the definitions from O of valid coloring and the color number 
C(Q) of a query, and state the size bounds of (9j. 

Definition 2.3. Given a conjunctive query 

Q = Ro{uo) <- Ri(ui) A ... A R n (u m ), 

and the set of functional dependencies for each input relation, a valid coloring of 
Q with c colors is a coloring C : var(Q) — > 2^ 1, """' C J assigning to each variable 
X £ var(Q) a set of colors C(X) C {1, . . . , c}, consisting of zero or more colors 
such that the following condition is satisfied: 

• For each functional dependency X%, . . . , X^ — > Y, 

C(Y)c\JC(Xi). 

i 

Definition 2.4. The color number of a query Q = Rq(uq) <— Ri(ux) A ... A 
Rn(u m ), denoted C(Q), is the maximum over valid colorings of Q of the ratio of 
the total number of colors appearing in the output variables Uq, to the maximum 
number of colors appearing in any given Ui,for i > 1. Formally: 

C(Q) := max f- — 

colorings maxj>i | Ux,e«i 

The main theorem of O is that the color number yields a tight bound on the 
worst-case size increase of general conjunctive queries either without functional 
dependencies, or with a set of simple functional dependencies (or simple keys). 
Formally, the following theorem is proven: 
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Theorem (Theorem 4.7 from (91). Given a query Q = R(uq) <— R\(ui) A ... A 
Rn{u m ) and set of simple functional dependencies, 



Furthermore, this bound is essentially tight: for any N > 0, there exists a database 
D with rmax(Q,D) < rep(Q) ■ N, and \Q{D)\ = N c( - Q \ where rep(Q) is the 
maximum number of times any specific relation Ri appears in Q. 

Additionally, it was shown that, in the setting in which general functional de- 
pendencies are given, the color number yields a lower bound. Specifically, 

Proposition (Proposition 6.3 from (9l). Given a query Q = Rq(uo) <— Ri{u\) A 
. . . A R n (u m ) and set of functional dependencies, there exists an instance D in 
which 



The proof of the above proposition is via a construction. This construction 
provides some insight into the relationship between the colorings of the variables, 
and conditional entropies, and we give a simplified proof in the case that m = n in 
Appendix |A] 

2.3 Conditional Entropy and Information Measures 

In this section we state the basic definitions of conditional entropy and in- 
formation measures, and then state some facts about Shannon and non-Shannon 
information inequalities, which will prove useful in the remainder of the paper. 

Definition 2.5. For discrete random variables X, Y with respective supports X, y, 
the conditional entropy of X given Y, denoted by H{X\Y) is given by 



The following fact follows from the above definition: 
Fact 2.6. For discrete random variables X, Y with respective supports X, y, 



\Q(D)\ < rmax(Q,Df( chase( - Q ». 




H(X\Y) := Y,v{y)H{X\Y = y) = - J] ^(s, y) log (p(x\y)) . 




H(X,Y) = H(X)+H(Y\X). 
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Definition 2.7. For discrete random variables X, Y, as above, the mutual infor- 
mation between X and Y is 

/(X;F):= V P^log^. 

The following fact follows from the above definition: 

Fact 2.8. For discrete random variables X, Y as above, 

I(X; Y) = I(Y; X) = H(X) + H(Y) - H(X, Y) = H(X) - H{X\Y). 

Definition 2.9. For discrete random variables X\ , . . . , X n with respective sup- 
ports X\ , . . . , X n , and n > 3, we recursively define their mutual information as 

I(X\; . . . ;X n ) = I(Xi; . . . ;X n _i) — I(X\; . . . ;X n -i\X n ), 

where the conditional mutual information is defined as 

I(X 1 ;...;X n _ 1 \X n )= p(x n )(I(X 1 ;...;X n _ 1 )\X n = x n ), 

and where for n = 2, mutual information is as defined in Definition \2. 71 

Unsurprisingly, the above information measures have a set-theoretic structure, 
and can be represented in an information diagram, from which basic relations be- 
tween information measures can be easily read off. Figure Q] illustrates a general 
information diagram for three variables. The following facts follow from the pre- 
vious definitions, and can easily be seen by considering the associated information 
diagram. (We refer the reader to Chapter 3 of |[23l for proofs of these facts and 
rigorous definition of the set-theoretic structure of information measures.) 

Fact 2.10. For discrete random variables X\ , . . . , X n , and any disjoint sets K, K' C 
tnl,: 



E 

S-.SnK^,SnK'=$ 



H(X K \X K >) = 22 *(S|*[»]-s) 



i{K\x K ,)= Y, ^M-s), 

S:SDK,SnK'=<$ 

where I(S\Xg/) denotes I{X\; . . . ; Xj\Xs>), far S = [j]. Note that we avoid 
the notation I(Xg\Xs'), which has the interpretation of I(X\, . . . ,Xj\Xg') = 
H(X S \X S >). 
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Figure 1 : The generic information diagram of X, Y, Z. Note that the set-theoretic 
properties of these information measures allows various information equalities 
to be read off from such a diagram; for example, I(X; Y) = I(X; Y; Z) + 
I(X; Y\Z), and H(Z) = I(X; Y; Z) + I{X; Z\Y) + I(Y; Z\X) + H(Z\X, Y). 



We now define the basic information inequalities. 

Definition 2.11. For discrete random variables X\, . . . , X n as above, and for a 
subset K C [n], denoting by Xk the tuple of all Xi for i E K, the Shannon 
information inequalities consist of all inequalities of the form 

H(Xi\X^_^y) > 0, 

for all i € [n], and 

HXxXjIXk) >o, 

for all i ^ j E n and K C [n] — {i, j}. 

We note that, as above, the mutual information expressions can be reexpressed 
in terms of entropies. For example, I{Xi;Xj\Xjc) = H{Xi\Xx)—H{Xi\Xj,XK) 
H{Xi,X K ) + H{Xj,X K ) - H{X K ) - H(X u Xj,X K ). (See & Chapter 14 
for further discussion of the Shannon inequalities.) 

The Shannon information inequalities are well-understood and were, initially, 
hypothesized to essentially capture the space of valid entropy configurations. How- 
ever, in a breakthrough work in 1998, Zhang and Yeung showed that there are fun- 
damental constraints on this space that are not captured by the Shannon inequali- 
ties, even for as few as four random variables l25l . This accounts for the lack of 
tightness in our upper bound. 
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3 Size Bounds 



We begin by giving our linear programming upper bound for the worst-case 
size increase. Throughout this section, we admit a slight abuse of notation, and re- 
fer to the entropy of a set of attributes of a database, interpreted in the natural way: 
given a database table with attribute set A = {X\, . . . , X^}, some fixed proba- 
bility distribution V over the tuples of the table, and two subsets S, S' C A, we 
refer to the conditional entropy Hz>(S\S') where S, S' respectively are interpreted 
to be the discrete random variables whose possible values consist of the re- 
spectively | S' \— tuples of values that the corresponding variables have in the tuples 
of the database table, with probabilities given according to V. 

Theorem 3.1. Given a query Q = chase(Q) = Ro(uo) <— R\(ui) A ... A 
R n (u m ), with var(Q) = {Xi, . . . , X^}, and a set of arbitrary functional depen- 
dencies, for any database D, 

\Q(D)\ <rmax(Q,D) s( - Q \ 

where rmax(Q, D) is the size of the largest relation among R±, . . . , R n in D, and 
s(Q) is the solution to the following linear program: 

maximize h(uo) 

subject to h(v,i) < 1 Vi > 1 

h(x t \x il , . . . , x i:j ) = for eachfd. X h , . . . , — > X t 

h(xi\x [k] _ {i} ) > Vi € [k] 

I(xi] Xj\xs) > Vi,j 6 [k] and S C [k] — {i, j}, 

where the variables of the linear program are the (unconditional) entropies h(xs) 
for all S C [A;], and the expressions involving mutual information or conditional 
entropies appearing in the constraints are implicitly considered to stand in for the 
corresponding linear expressions of these variables (as described in Section |2?1 ). 

Proof. The first step in the proof is to establish the connection between entropy 
and worst-case size increases. Given our query Q and database D, let c be such 
that \Q(D)\ = rmax(Q,D) c . Let Q' = R' (var(Q)) <- Rx(u x ) A ... A R n (u m ) 
be the query derived from Q by including all query variables in the output, and 
define the distribution V over the tuples of Q'{D) to be such that the marginal 
distribution V Uo over the values of the | no j -tuples corresponding to variables in 
uq is the uniform distribution. Note that such a choice for T> is not necessarily 
unique, unless uq = var(Q). Let Hx>{ui) denote the entropy of the projection of 
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the distribution V onto the positions labeled by the variables of Ui. Observe that 
for any i £ [m] , 



Hv(u ) H v (u ) log(\Q(D)\) > 

H v (v.i) ~ H unifi {ui) ~ ]og(\Ri{p)\) -°' K) 

where unifi is the uniform distribution over the tuples of Ri(D). This provides the 
motivation for the form of our linear program: maximizing the entropy of no while 
bounding the entropies of each 114. 

To see that the value of the above linear program provides an upper bound on 

iog(lg(-D) l) t t ^ t f t g c r^i ^ quantity Ht> , — r satisfies all the 

log(\R i (D)\) ' J J maxi>i _ff D (ui) 

constraints that the corresponding variable h(S) is subject to in the linear program, 
including the last two sets of constraints that represent the Shannon information 
inequalities, and thus by Equation (2) the value of the solution to the linear program 
must be at least r^rnrS • □ 

Iog(|-Ri(-D)|) 

In order to make the size bound given by the solution to the linear program of 
Theorem [3j] tight, we would need to add additional constraints so as to enforce the 
non-Shannon information inequalities. Unfortunately, it was recently shown that 
even for just four variables, there are infinitely many independent such inequali- 
ties ifrTI 

We note that the jump in difficulty of establishing tight size bounds occurs 
when the left-hand sides of functional dependencies go from having single vari- 
ables, to having 2 variables. It is not hard to show that any size bounds for the case 
where functional dependencies have left-hand sides with at most two variables can 
be extended to work for arbitrary functional dependencies, via the following propo- 
sition. 

Proposition 3.2. Given a query Q = chase(Q) and set of functional dependen- 
cies, there exists a query Q' with the following properties: 

• each functional dependency of Q' has at most two variables on its left-hand 
side, 

• Q' = chase(Q'), 

• the set of functional dependencies of Q' is at most polynomially larger than 
that of Q, 

• the description of Q' is at most polynomially larger than that of Q, 

• the worst-case size increase of Q and Q' are identical. 
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. C(Q) = C(Q'). 

Proof. We shall iteratively remove functional dependencies from Q that have 3 or 
more variables occurring on their left-hand sides, via the addition of a (polynomial 
number) of additional variables, relations, and functional dependencies. 

Given a functional dependency Xi . . . X& — » Y, we add a relation R^XiX^Z), 
with the new variable Z, together with the functional dependencies X1X2 — > 
Z, Z ->• Xi, Z ->• X 2 . We then add the relation R'(ZX 3 ...X k Y), together with 
the functional dependency ZX3 . . . X& — » K Finally, we remove the functional 
dependency Xi . . . X^ — ► 1" from the set of functional dependencies. 

Iteratively applying the above procedure until there are no more functional de- 
pendencies (other than implied ones) with more than two variables on their left- 
hand sides clearly results in a query Q' with at most a polynomially longer de- 
scription, and polynomially more functional dependencies. Additionally, since all 
new relations are distinct, and all original functional dependencies are implied by 
the new set of functional dependencies, chase(Q') = Q' . To see that the size in- 
crease of Q' is the same as that of Q, note after each single iteration of the above 
procedure, the size increase must remain unchanged, as the values taken by vari- 
ables Xi,X2 dictate that taken by Z, and vice versa, defining a 1 : 1 mapping 
between tuples of Q{D) and tuples of the result of the query generated after one 
step of the procedure. To conclude, there is a natural mapping between valid col- 
orings of Q, and the query obtained after one step of the above procedure, namely 
C(Z) ^£(Xi)u£(X 2 ). □ 



4 The Color Number and Entropy 



We now reexamine the color number in an effort to better understand the types 
of entropy structures that it can capture. As the following proposition shows, the 
color number can be defined via the linear program of Theorem [3j] with the addi- 
tion of some extra constraints on the entropies. In particular, we require extra con- 
straints that enforce that all mutual information measures be nonnegative. (Note 
that the Shannon inequalities imply that all mutual information measures of two 
variables be nonnegative; however, as Figure [2] depicts, the mutual information of 
more than two variables can be negative.) 

Theorem 4.1. Given a query Q = chase(Q) = Rq(uq) <— Ri(u\)A. . .AR n (u m ), 
with var(Q) = {X\, . . . , X^}, and a set of arbitrary functional dependencies, 
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C(Q) is equal to the solution to the following linear program: 



maximize h(uo) 

subject to h(v,i) < 1 Vz > 1 

h(x t \x il ,...,Xi j )=0 for eachf.d. X h ,..., -> X t 

I(x h ;,.. •;vi j \x[k]-{h,...,i j }) > ^ sets {h,-- ■ = S C [A;], 

where the variables of the linear program are the (unconditional) entropies h(xs) 
for all S C [A;], erne? ?/ze expressions involving mutual information or conditional 
entropies appearing in the constraints are implicitly considered to stand in for the 
corresponding linear expressions of these variables (as described in Section \Z3l . 

Proof. We first show that given any valid coloring achieving color number C(Q), 
we can find a feasible point for the linear program with value C(Q). Given a valid 
coloring in which at most r colors occur together in the labels of any input atom, 
for every set S Q [k], we set 

I(S\x [k] _ s ) = , 

where I(S\x[ k] _ s ) denotes I(x h ;...;x ij \x[ k] _ s ), with S = {X h X^}. Note 
that these 2 n mutual information values are sufficient to determine the values of all 
variables in the linear program. In particular, these 2 n mutual information mea- 
sures are the values that would appeal - in an information diagram. From Fact 12.101 
for any disjoint sets T,T' C [A;], we will now express I(T\xt>) in terms of the 
color labels. We note that for distinct sets Si, S2, the corresponding sets of labels 
HieS — [Ji^s will be disjoint, because these sets consist of exactly 

those colors appearing in the labels of each element of Sj and not in any of the 
labels of elements not in Sj. Thus the sum in Fact l2.10l may be expressed in terms 
of the size of the union of these sets for S containing T and disjoint from T'. It 
is straightforward to see that this union consists of exactly those colors appearing 
in the labels of each element of T and not in any of the labels of elements of T', 
yielding: 

I(T\xt>) 



n i6T £(^)-u 6 T'£(^ 



r 

It is now easy to see that this construction yields a feasible point for the linear 
program. First observe that all the information inequalities are trivially satisfied, 
since for every set SC. [k], I(S\x[ k ]_ s ) > in our construction. To see that the 
equality constraints given by the functional dependencies are observed, note that 
the dependency X\,. . . ,Xj —> Xj + i implies that C(Xj + i) — UieM = 
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and thus in the above assignment, I(xj+i\xyi) = 0, as desired. (Note that, by def- 
inition, h(xj+\\xy]) = I(xj + i\xy]).) Finally, to see that the first set of constraints 
are observed, note that for any j < k, h(xyj) = Ylsst 5n[j]^0 H^\ x \j]-s)-> which, 

by our construction, is precisely — — , which is bounded by 1 whenever S 

is the index set of an input atom, and which will equal C(Q) when S is the index 
set of uq by the definition of the color number. 

For the other direction, given a rational feasible point for the linear program 
with objective function value v, where all variables have values r%/q, for integers 
Vi,q, with q being the common denominator, we will construct a coloring with 
color number C(Q). The final set of constraints of the LP implies that for any set 
S C [k], I(S\x[k]_s) = if ^ 0- Furthermore, since our feasible point is rational, 
rs € N. To populate our coloring, we begin with the empty coloring, and then for 
each 5 C [k], we add q ■ i(S\x[k]-s) unique colors to the labels of all Xi for which 
i G 5. To see that this coloring obeys the functional dependencies, note that for 
X\, . . . ,Xj — > Xj+i, we have that I(xj + i\Xy]) = 0, and thus by Fact 12.101 for 
any S C [k] - [j] such that j + 1 £ S, I(S\X[ k ]_ s ) = 0, from which it follows that 
in our construction C(Xj+i) C Uieh'l Finally, to see that the color number 

is at least the value v, of the linear program, note that by Fact |2.10l a total of 

]T q ■ I(S\X [k] _ s ) = q ■ h(X s ) 

SC[k] s.t. SnKjt® 

unique colors are assigned to each set Xs, and thus the color number is at least 
h(uo), as desired. □ 

Remark 4.2. From the above characterization of the color number, it follows that 
for all the settings in which the color number yields a tight bound on the worst-case 
size increase (i.e. when no functional dependencies are specified, or only simple 
dependencies), there exist worst-case instances whose corresponding information 
diagrams have only nonnegative entries. 

4.1 A Super- Constant Gap 

Leveraging the understanding of the entropy structures that are compatible with 
the color number given by the previous theorem, we now show that there is a 
super-constant gap between the exponent of the true worst-case size increase, and 
the color number (in the case of general functional dependencies). We suspect, 
however, that in the majority of practical applications, this gap between the upper 
and lower bounds will be small. 

Theorem 4.3. For any fixed constant a G R, there exists a conjunctive query Q 

and set of functional dependencies, and database D, such that \Q(D)\ > rmax(Q, D^ aC ( chase iQ)) . 
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Figure 2: The information diagram of X\ \ , . . . , X^i in our construction for k = 4. 
Note that any set of size 2 or more contains all the entropy of all four variables. 
The negative mutual information I{X\y, Xi^; ^1,3; -^1,4) = —2 suggests that no 
valid coloring can closely approximate the entropy structure, which is leveraged 
in our construction to yield a super-constant gap between the color number and 
worst-case size increase. 



Proof. We shall construct a family of queries, and associated databases whose 
color numbers fall short of the true size increase by a superconstant factorH Fix an 
even integer k, and consider the following query Q over k 2 /2 variables Xij, for 
i G andj e {l,...,fc/2}: 

k/2 k 

Q = R(Xi t i, . . . ,Xij, . . . ,X k)k / 2 ) <— f\ Ri(X lti , . . . ,Xk,i)/\f\ Ti(X i: i, . . . ,X i)h / 2 ). 

i=l i=l 

Additionally, for each j € {1, . . . , k/2} we impose the following functional de- 
pendencies: given any set S C {Xij, . . . , X k j}, with |5| > k/2, for any i, 

S — ► Xij. 

Intuitively, the above construction has k/2 groups of k vaiiables, such that 
amongst any group, any set of k/2 of those variables suffice to recover the re- 
maining k/2 variables in that group. The information diagram of one group of 
the construction in the case k = 4 is depicted in Figure [2] Given any integer 
N, we will construct a database D such that for all i G [k/2],j G [k], we have 

2 Our construction is a generalization of a construction suggested to us by Daniel Marx. 
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\Ri(D)\ = N k / 2 = \Tj(D)\. The values assigned to positions labeled by Xij 
and Xy ji will be disjoint whenever j ^ f; i.e. the values assigned each of the 
fc/2 groups are disjoint. Each of the N k ^ 2 tuples of Ri(D) will be constructed 
so as to be Shamir (k/2,k) secret shares ETI . That is, given the values of any 
k/2 attributes Xij, . . . , X k j 2 ^ the values of the remaining k/2 attributes can be 
uniquely determined, and for S C {Xi^, . . . , -X^}, 



\*s(Ri{D))\ 



if IS] < fc/2, 
N k ' 2 if|S|>fc/2. 



Since Q(D) consists of the complete join of each Ri, \Q(D)\ = (jV fe / 2 )^ 2 = 
_/V fc2 / 4 , whereas the size of the largest input relation is rmax(Q, D) = N k / 2 . We 
now show that C{chase{Q)) = C(Q) < 2, which will complete our proof of the 
theorem. 

First observe that it suffices to consider the case that for j ^ j', C(Xij) n 
C(Xiiji) = 0, because, assuming otherwise, if the common color c lay in the 
intersection, by removing the color c from the labels for all i" , we still 

have a valid coloring (since there are no functional dependencies between groups), 



and the color number could only have increased. Let n = | (JjLi and 
\ k l 2 r(Y. M - x^ k / 2 



ti = | (Jj=i ^■( x i,j)\ = YljLi denote the number of colors assigned to 

the variables of each input atom. Thus in any optimal coloring, we have 

k/2 k k/2 

iU^«)i=Eip £ ( jr i.«)i=E r «- 

Xij i=l j=l i=l 

Next, observe that each element of must occur in the labels of at least 

fc/2 other variables Xyj; if this were not the case, then there would exist a set 
S C {X hj ,. . . , X kJ } of size \S\ > k/2, such that C(X itj ) % Ux,, as C ( x i',j)> 
which violates one of the functional dependencies. Thus it follows that 

f)|£(Xy)|>| ri . 
i=l 

To conclude, putting the above equations together, we have 

k , k/2 



^ i=l 



(k/2) y^ k / 2 r i 

and thus there must be at least one i such that ij > fc =1 ' = k Si=i r «' an< ^ 
thus C(Q) < 2. □ 
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5 Complexity Considerations 



From a complexity standpoint, the results of the previous setting are not encour- 
aging. Both the upper bound, and lower bound of C(Q) are given as the solutions 
to exponential-sized linear programs. This prompts the question of whether one 
can efficiently determine anything about the size of the result, in this setting with 
general functional dependencies. (It is shown in (9l that when one only has simple 
functional dependencies, tight size bounds can be efficiently computed.) With gen- 
eral functional dependencies, even computing chase(Q) can be intractable. Nev- 
ertheless, we show that when chase(Q) is given, or can be efficiently computed 
(for example, when all the input relations have bounded arities), we can efficiently 
decide whether the result of the query with a set of general functional dependen- 
cies can be any larger than the input relations. The proof relies on a proposition 
from J9), and then reduces the question at hand to the satisfiability of a sequence 
of tractable SAT instances — one for each input relation. 

Theorem 5.1. Given a conjunctive query Q = Rq(uq) <— R±(ui) A ... A R n (u m ) 
with an arbitrary set of functional dependencies, such that Q = chase(Q), it can 
be efficiently decided whether the results of Q can be larger than the input rela- 
tions, in which case there exists an instance D with |Q(JD)| > ' nrun (Q< D ) 



rep(Q) 

The proof of the theorem relies on the following proposition: 

Proposition (Proposition 6.1 from O). A query Q = Ro(uq) <— Ri(u\) A . . . A 
R n {u m ) with arbitrary functional dependencies is sparsity preserving if, and only 
if C (chase(Q)) = 1. Equivalently, for any database D, \Q(D)\ < rmax(Q,D) 
if, and only if C (chase(Q)) = 1. Furthermore, if C{chase{Q)) > 1, then 
C(chase(Q)) > 



Proof of Theorem \5. 1 \ By the above proposition, it suffices to show that one can 
decide whether C(Q) > 1 in polynomial time. First observe that a necessary and 
sufficient condition for C(Q) > 1 is the existence of some coloring C such that for 
each relation i?j,with i > 1, there is a color Cj such that Cj € Ux-e« ^(^i)> ' 3U ^ 
Cj ^ Ux eu ^(^i)- ^ e will represent this condition as a set of n tractable SAT 
expressions, one for each input relation, as follows. Our set of SAT variables will 
be {x\, . . . , x\ var iQ\\\, in natural correspondence with the set of query variables 

V = {-Xl, • • • ,X| mr (Q)|}. 

From Proposition 13 -21 it suffices to prove our theorem in the case that all func- 
tional dependencies have at most two variables on their left-hand sides. Given p 
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functional dependencies X^X^ — > X mi , . . . , Xj p Xfr p — > X mp , our SAT expres- 
sion for relation i will have the form 

SATi= y\ -"XjAl V x i \ A ( x h Vx fei V->x mi )A. . .A(xj p Vx fcp V->x mp ). 

Any satisfying assignment of SATi yields a valid coloring of Q that uses exactly 1 
color, and has the property that no variable in itj has a color, but at least one variable 
in no has a color; such a coloring is given by assigning all variables that are set to 
false to not have the color, and all variables set to true to have the color. To see 
this, note that the first part of SATi ensures that no variable occurring in m can be 
true in a satisfying assignment; the second part of SATi ensures that at least one 
variable in the output projection will be colored, and the third part of SATi ensures 
that the functional dependencies are respected. Since any set of valid colorings can 
be combined to yield a valid coloring (by letting £1,2 (-X^) = £ipQ) U £2pQ)), it 
follows that if, for alH = 1, . . . , n, SATi is satisfiable, then there exists a coloring 
with n colors, yielding C{Q) > > 1. Conversely, if, for some i, SATi is 
not satisfiable, then there is no valid coloring of the variables in which some color 
appears in the output projection but not in the coloring of a variable of m, in which 
case C{Q) = 1. 

What remains is to verify that SATi can be solved efficiently. We start by 
decomposing SATi into its three basic components: SATi = C\ A C2 A C3, where 
c i = Ax s &h °2 = Vx jGM0 a?j, and C 3 = f\ h=ly .. tP (x jh V x kh V -x m J. 
We start by removing all variables from C2 that appear negated in C\. Then, we 
simplify SATi via a series of at most \ V\ 'passes'. In each pass, we traverse each 
clause (xj h V x^ h V ~^x mh ) of C3; if x mh occurs in C\, then we remove the clause 
(xj h V Xk h V ~<x mh ) from C3 and proceed. Otherwise, if either Xj h , or x^ h occur 
in C\, we remove the occurring variable(s) from this clause in C3 and proceed. 
Finally, if a clause of C3 consists of a single negated literal ->x. , we remove that 
clause from C3, and add the literal to C\. If no new variable is added to C\ during 
a pass, this means that no additional passes will alter the clauses, so we halt. 

It is not hard to see that each pass does not alter the satisfiability of the expres- 
sion C\ A C2 A C3. Furthermore, since each pass either adds at least one variable 
to C\, or is the last pass, there will be at most |V| passes. If at any point a clause 
in C3 becomes a single literal X{ that also occurs in C\, or C2 consists of a subset 
of the variables occurring in C\, then SATi is clearly not satisfiable; if this does 
not occur, then no additional passes will alter the clauses, and a satisfying assign- 
ment for SATi is given by setting all the variables in C\ to be false, and all other 
variables to be true. □ 
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6 Conclusions 



We view the main contribution of this work as establishing a firm connection 
between worst-case size bounds and multivariate entropy structures, allowing the 
tools of information theory to be leveraged towards database analysis. This con- 
nection promotes two main lines of future work. The first direction is investigating 
whether one can explicitly characterize the worst-case size increase, even if that 
characterization is exponentially large. It is also conceivable that, while exactly 
characterizing the size increase might not be possible, one can explicitly (and pos- 
sibly even efficiently) compute an approximation of the worst-case size increase. 
This seems like a deep and challenging question, and such a result would likely 
involve a significant advance in the understanding of the structure of non-Shannon 
type information inequalities. 

The second direction is investigating which types of entropy structures arise 
from databases and their associated queries in practice. Such an investigation 
would help determine where practical instances lie on the spectrum between the 
basic color number bounds and the more intricate bounds of Theorem 13.11 Such 
database measures as sparsity and treewidth were introduced with corresponding 
goals in mind, and have proved effective at succinctly capturing the ease with which 
certain database operations can be done. We propose the following measure of the 
entropy structure of a database and associated query, in the hope that it will suc- 
cinctly capture this new facet of database complexity, as suggested by the results 
of this paper: 

Definition 6.1. The knitted complexity of a database with respect to a query is the 
ratio of the sum of the absolute values of the mutual informations of all subsets of 
the query variables, to the sum of the (signed) mutual informations of all subsets 
of the query variables. 
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A Simplified Proof of Proposition 6.3 from [|9l 

For clarity, we state and prove the proposition in the case that each input rela- 
tion occurs only once in the query, and thus Q = chase(Q). 

Proposition A.l. Given a query Q = Rq(uq) <— R\{u\) A ... A R n {u n ) and set 
of functional dependencies, there exists an instance D in which 

\Q(D)\ > (rmax(Q,D)) c{Q) . 

Proof. Given an integer N, and any valid coloring with d colors, with d! < d colors 
appearing in the labels of the output variables, such that the coloring achieves color 
number C(Q), we shall construct an instance of D with the property that \Q(D) | = 
N d ', and rmax(Q, D) < N d ' l c{ - Q \ 

Consider a table of arity d, with attributes C\, . . . , corresponding to each 
of the d colors. We construct the table T to have N d tuples, such that the projection 
7r d 1 ,...,Ci (D) of D onto any k attributes , . . . , Ci k has size N k . We denote the 
N values that a given attribute Q may take by the values i\, . . . , i^. (Thus T is 
just the total join of the d columns of size N.) 

Next, we populate a given relation Rj, that has variables X\, . . . , in the cor- 
responding atom Uj . Assume, without loss of generality that in the given coloring 
of Q> Ui=i k = {1, • • • j q}- We populate Rj with N q tuples derived from 

the N q tuples in TTCi,...,C q (T), where the values that attribute Xi takes are given 
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by an ordered list of the values taken by the C-s that are in C(Xi). To illustrate, 
say q = 3, and (1., 2., 3.) is a tuple of itc- L ,...,c q {T), if Rj(XY) appears in Q, and 
C{X) = {1,2}, C(X) = {2,3}, then we add the tuple ([l.,2.], [2., 3.]) to Rj, with 
the value [1., 2.] appearing in the first attribute of Rj. From the definition of valid 
coloring, it follows that the constructed database satisfies all functional dependen- 
cies. Additionally, by construction, if all variables appeared in the output, all N d 
tuples would appear in the output, and thus \Q(D) \ = N d . For each input relation 
Rj, we have \Ri(D)\ = N k , where k = | {J X£u . C(X)\, as desired. □ 
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